Indexing by Latent Dirichlet Allocation and an Ensemble Model

نویسندگان

  • Yanshan Wang
  • Jae-Sung Lee
  • In-Chan Choi
چکیده

The contribution of this paper is two-fold. First, we present indexing by Latent Dirichlet Allocation (LDI), an automatic document indexing method with a probabilistic concept search. The probability distributions in LDI utilizes those in Latent Dirichlet Allocation (LDA), which is a generative topic model that has been previously used in applications for document indexing tasks. However, those ad hoc applications, or their variants with smoothing techniques as prompted by previous studies in LDA-based language modeling, would result in unsatisfactory performance as the terms in documents may not properly reflect concept space. To improve the performances, we introduce a new definition of document probability vectors in the context of LDA and present a novel scheme for automatic document indexing based on it. Second, we propose an ensemble model (EnM) for document indexing. The EnM combines basis indexing models by assigning different weights and tries to uncover the optimal weights with which the mean average precision (MAP) is maximized. To solve the optimization problem, we propose three algorithms, EnM.B, EnM,CD and EnM.PCD. EnM.B is derived based on the boosting method, EnM.CD the coordinate descent method, and EnM.PCD the parallel property of the EnM.CD. The results of our computational experiment on a benchmark data set indicate that both the proposed approaches are viable options in the document indexing tasks. c © 2013 Published by Elsevier Ltd.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Confidence measure for speech indexing based on Latent Dirichlet Allocation

This paper presents a confidence measure for speech indexing that aims to predict the indexing quality of a speech document for a Spoken Document Retrieval (SDR) task. We first introduce how the indexing quality of a speech document is evaluated. Then, we present our method to predict the indexing quality of a speech document. It is based on confidence measure provided by an automatic speech re...

متن کامل

Ensemble Approaches for Large-Scale Multi-Label Classification and Question Answering in Biomedicine

This paper documents the systems that we developed for our participation in the BioASQ 2014 large-scale bio-medical semantic indexing and question answering challenge. For the large-scale semantic indexing task, we employed a novel multi-label ensemble method consisting of support vector machines, labeled Latent Dirichlet Allocation models and meta-models predicting the number of relevant label...

متن کامل

Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA Compatible Devices

In this paper, we propose an acceleration of collapsed variational Bayesian (CVB) inference for latent Dirichlet allocation (LDA) by using Nvidia CUDA compatible devices. While LDA is an efficient Bayesian multi-topic document model, it requires complicated computations for parameter estimation in comparison with other simpler document models, e.g. probabilistic latent semantic indexing, etc. T...

متن کامل

Latent Dirichlet Allocation

We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JASIST

دوره 67  شماره 

صفحات  -

تاریخ انتشار 2016